CopyKit Workshop
Overview
The goal of CopyKit is to help you analyze single cell DNA sequencing datasets for copy number.
This tutorial presents an example workflow of CopyKit. Detailed information for methods used of each function can be found in CopyKit user guide.
Installation
Install the stable or development version from github using
devtools. Note that knn smoothing is only available in development version.Install from package archive files, download
v0.1.1here.
devtools::install_github("navinlabcode/copykit")
devtools::install_github("navinlabcode/copykit@devel")Check version and load libraries.
packageVersion("copykit")## [1] '0.1.1'
library(copykit)Parallelization
Whenever possible, CopyKit uses the BiocParallel framework.
library(BiocParallel)
register(MulticoreParam(workers = 50, progressbar = F), default = T)Confirm parameters:
bpparam()## class: MulticoreParam
## bpisup: FALSE; bpnworkers: 50; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE; bpforceGC: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: FORK
Segmentation
Run CBS segmentation from bam files.
This part takes more than 10 mins depending on the sample size thus will be skipped in the workshop.
tumor <- runVarbin(
dir = "~/path/to/bam/files/",
genome = "hg38",
resolution = "220kb", # c("220kb", "55kb", "110kb", "195kb", "280kb", "500kb", "1Mb", "2.8Mb")
remove_Y = FALSE,
is_paired_end = FALSE,
seed = 17,
BPPARAM = bpparam()
)An example of basic segmentation and metric calculation of sequencing data from mock data generated from the built-in simulator.
copykit_obj <- mock_bincounts(ncells = 5,
ncells_diploid = 1,
position_gain = c(1,5,1000))## Smoothing bin counts.
## Running segmentation algorithm: CBS for genome hg38
## Merging levels.
## Done.
copykit_obj <- runMetrics(copykit_obj)## Calculating overdispersion.
## Counting breakpoints.
## Done.
Calculated metric was stored in colData and can be
visualized using plotMetrics().
kable(colData(copykit_obj))| sample | ground_truth | overdispersion | breakpoint_count | |
|---|---|---|---|---|
| V1 | V1 | FALSE | 0.0002988 | 0 |
| V2 | V2 | TRUE | 0.0002742 | 0 |
| V3 | V3 | TRUE | 0.0002272 | 0 |
| V4 | V4 | TRUE | 0.0001224 | 0 |
| V5 | V5 | TRUE | 0.0001481 | 0 |
plotMetrics(copykit_obj, metric = "overdispersion", label = "ground_truth")Example data set
First we load in the example CopyKit object. This data set was generated from scDNA-seq of a human liver sample.
tumor <- readRDS("sample_obj.rds")Then take a look at the data structure of CopyKit object.
head(tumor)## class: CopyKit
## dim: 6 600
## metadata(3): genome resolution vst
## assays(6): bincounts ft ... ratios logr
## rownames(6): 1 2 ... 5 6
## rowData names(3): gc_content abspos arm
## colnames(600): PMTC6LiverC103DL6L7S3_S1255_L004_R1_001
## PMTC6LiverC204DL1S7_S588_L002_R1_001 ...
## PMTC6LiverC271AL6L7S5_S1423_L004_R1_001
## PMTC6LiverC136AL6L7S3_S1288_L004_R1_001
## colData names(9): sample reads_assigned_bins ... reads_total
## percentage_duplicates
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowRanges has: 6 ranges
## Phylo: Phylogenetic tree with 0 tips and nodes
## consensus dim: 0 0
kable(rowData(tumor)[1:5,])| gc_content | abspos | arm |
|---|---|---|
| 0.613786 | 1039596 | p |
| 0.591817 | 1264158 | p |
| 0.543128 | 1830107 | p |
| 0.572288 | 2065015 | p |
| 0.604703 | 2290536 | p |
kable(colData(tumor)[1:5,])| sample | reads_assigned_bins | reads_unmapped | reads_duplicates | reads_multimapped | reads_unassigned | reads_ambiguous | reads_total | percentage_duplicates | |
|---|---|---|---|---|---|---|---|---|---|
| PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | 506994 | 27749 | 67923 | 0 | 105388 | 115 | 708169 | 0.096 |
| PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | 476881 | 36334 | 45716 | 0 | 98424 | 96 | 657451 | 0.070 |
| PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | 549322 | 27843 | 50278 | 0 | 111227 | 107 | 738777 | 0.068 |
| PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | 812987 | 36743 | 78314 | 0 | 155468 | 179 | 1083691 | 0.072 |
| PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | 624401 | 35599 | 74604 | 0 | 131717 | 111 | 866432 | 0.086 |
kable(segment_ratios(tumor)[1:5,1:5])| PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 |
|---|---|---|---|---|
| 1 | 1 | 0.98 | 0.71 | 1 |
| 1 | 1 | 0.98 | 0.71 | 1 |
| 1 | 1 | 0.98 | 0.71 | 1 |
| 1 | 1 | 0.98 | 0.71 | 1 |
| 1 | 1 | 0.98 | 0.71 | 1 |
Quality Control
First we want to detect euploid cells by calculating the sample-wise
coefficient of variation from the segment ratio means. The threshold can
be changed from the automatic detection to a custom threshold with the
argument resolution. By setting a threshold = 0.1,
findAneuploidCells() will mark as euploid all cells with a
coefficient of variation less or equal than 0.1.
tumor <- findAneuploidCells(tumor, seed = 17)## number of iterations= 15
## Copykit detected 244 that are possibly diploid cells using a resolution of: 0.067
## Added information to colData(CopyKit).
Then we detect low-quality samples by a k-nearest-neighbors filtering (default k=5). Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default resolution=0.9).
tumor <- findOutliers(tumor, k = 5, resolution = 0.9)## Calculating correlation matrix.
## Marked 70 cells as outliers.
## Adding information to metadata. Access with colData(scCNA).
## Done.
kable(colData(tumor)[1:5,])| sample | reads_assigned_bins | reads_unmapped | reads_duplicates | reads_multimapped | reads_unassigned | reads_ambiguous | reads_total | percentage_duplicates | is_aneuploid | find_normal_cv | cell_corr_value | outlier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | 506994 | 27749 | 67923 | 0 | 105388 | 115 | 708169 | 0.096 | FALSE | 0.02 | 1.000 | FALSE |
| PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | 476881 | 36334 | 45716 | 0 | 98424 | 96 | 657451 | 0.070 | FALSE | 0.00 | 1.000 | FALSE |
| PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | 549322 | 27843 | 50278 | 0 | 111227 | 107 | 738777 | 0.068 | FALSE | 0.06 | 0.678 | TRUE |
| PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | 812987 | 36743 | 78314 | 0 | 155468 | 179 | 1083691 | 0.072 | TRUE | 0.54 | 0.972 | FALSE |
| PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | 624401 | 35599 | 74604 | 0 | 131717 | 111 | 866432 | 0.086 | FALSE | 0.00 | 1.000 | FALSE |
Visualize the QC results by a heatmap.
plotHeatmap(tumor,
order_cells = "hclust",
label = c('is_aneuploid','outlier'),
row_split = 'outlier',
n_threads = 40)In the last step of the QC we remove unwanted cells by sub-setting the CopyKit object.
tumor <- tumor[,colData(tumor)$is_aneuploid == TRUE]
tumor <- tumor[,colData(tumor)$outlier == FALSE]The dataset should now be ready to proceed with the analysis.
Clustering
We apply dimension reduction and embed the data into 2 dimensional space.
tumor <- runUmap(tumor, seed = 17)
kable(reducedDim(tumor)[1:5,])| PMTC6LiverC154AL1S2_S154_L001_R1_001 | -2.9639254 | 0.8397497 |
| PMTC6LiverC72AL3S1_S840_L003_R1_001 | -2.0085797 | 0.8467716 |
| PMTC6LiverC223AL1S3_S223_L001_R1_001 | -0.3869732 | -0.9430873 |
| PMTC6LiverC153AL1S6_S537_L002_R1_001 | -2.6216934 | 0.8785209 |
| PMTC6LiverC198DL1S7_S582_L002_R1_001 | -1.9747129 | -0.0633721 |
plotUmap(tumor)By default, CopyKit uses hdbscan
as clustering method and it is recommended to do so. To perform
clustering, k should be pre-determined To help with
choosing k, CopyKit provides the helper function
findSuggestedK() to bootstrap clustering over a range of k
values, and returns the value that maximizes the jaccard similarity.
tumor <- findSuggestedK(tumor, k_range = 5:9)## Calculating jaccard similarity for k range: 5 6 7 8 9
##
## Suggested k = 7 with median jaccard similarity of: 0.951
plotSuggestedK(tumor)plotSuggestedK(tumor, geom = 'tile')Finally we cluster the cells using hdbscan with preferred k.
tumor <- findClusters(tumor)## Using suggested k_subclones = 7
## Finding clusters, using method: hdbscan
## Found 2 outliers cells (group 'c0')
## Found 4 subclones.
## Done.
Check the clustering result.
tumor <- calcConsensus(tumor);tumor <- runConsensusPhylo(tumor)
plotUmap(tumor, label = 'subclones')plotHeatmap(tumor,
label = 'subclones',
n_threads = 40)It is possible that a subgroup of clustering outliers is identified. Outliers are added to subgroup c0 and should be removed.
tumor_c0 <- tumor[,colData(tumor)$subclones == 'c0']
tumor <- tumor[,colData(tumor)$subclones != 'c0']
colData(tumor)$subclones <- droplevels(colData(tumor)$subclones)Excluded data can be easily merged back. This can also be used to merge 2 CopyKit objects.
merged <- cbind(tumor, tumor_c0)Ploidy and Integer estimation
CopyKit supports different methods of calculating integer copy number profiles.
Computational ploidies: scquantum is a method that estimates ploidy, and uses the ploidy estimate to scale the data to absolute copy numbers given bincount data from single-cell copy number profiling.
Fixed: a fixed value of ploidy (generally determined using Flow Cytometry) will be used to scale all cells.
tumor <- calcInteger(tumor,
method = 'fixed',
ploidy_value = 4.3)Plot the integer copy number and check the rounding error
plotHeatmap(tumor,
assay = 'integer',
label = 'subclones',
n_threads = 40)plotHeatmap(tumor,
assay = 'integer',
rounding_error = TRUE,
label = 'subclones',
n_threads = 40)
We can also check profile of any cell using the
plotRatio(). Here shown profile of a cell from c1.
plotRatio(tumor, sample_name = 'PMTC6LiverC281AL6L7S5_S1433_L004_R1_001')Phylogeny
To run a phylogenetic analysis of cells’ copy number profiles, use
the function runPhylo(). Available methods are Neighbor Joining and
Balanced Minimum
evolution. Trees can also be visualized by
plotPhylo()
tumor <- runPhylo(tumor,
assay = 'segment_ratios',
metric = 'manhattan',
n_threads = 40)
phylo(tumor)##
## Phylogenetic tree with 311 tips and 310 internal nodes.
##
## Tip labels:
## PMTC6LiverC154AL1S2_S154_L001_R1_001, PMTC6LiverC72AL3S1_S840_L003_R1_001, PMTC6LiverC223AL1S3_S223_L001_R1_001, PMTC6LiverC153AL1S6_S537_L002_R1_001, PMTC6LiverC198DL1S7_S582_L002_R1_001, PMTC6LiverC161AL4L5S1_S929_L003_R1_001, ...
##
## Rooted; includes branch lengths.
plotPhylo(tumor, label = 'subclones')
Single cell phylogeny is very prone to data noise. By taking the median
of all cells from the same subclone, we can build subclonal consensus
profiles and construct phylogeny from there. We can also root the
consensus tree by a diploid profile.
tumor <- calcConsensus(tumor, assay = 'segment_ratios')
kable(consensus(tumor)[1:5,])| c1 | c2 | c3 | c4 | |
|---|---|---|---|---|
| V1 | 0.69 | 0.73 | 0.68 | 0.69 |
| V2 | 0.69 | 0.73 | 0.68 | 0.69 |
| V3 | 0.69 | 0.73 | 0.68 | 0.69 |
| V4 | 0.69 | 0.73 | 0.68 | 0.69 |
| V5 | 0.69 | 0.73 | 0.68 | 0.69 |
tumor <- runConsensusPhylo(tumor, root = 'neutral')
consensusPhylo(tumor)##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## c1, c2, c3, c4
##
## Rooted; includes branch lengths.
plotPhylo(tumor, label = 'subclones', consensus = TRUE)To further understand the what copy number events contribute to the lineage, we can also plot consensus heatmaps annotated by oncogenes.
genes = c("CDKN2A",
"FGFR1",
"TP53",
"PTEN",
"MYC",
"CDKN1A",
"MDM2",
"AURKA",
"PIK3CA",
"CCND1",
"KRAS")
tumor <- calcConsensus(tumor, assay = 'integer')
plotHeatmap(tumor,
consensus = TRUE,
label = 'subclones',
genes = genes,
n_threads = 40)Or more intuitively, check copy number states across of genes in different subclones.
plotGeneCopy(tumor,
genes = genes,
label = 'subclones',
dodge.width = 0.8)Also note that all the plot functions except heatmap are based on
ggplot2, thus we can further customize them by adding
themes, etc..
plotGeneCopy(tumor,
genes = genes,
label = 'subclones',
dodge.width = 0.8) +
ggplot2::ylim(c(0,4)) +
ggplot2::ggtitle("Copy number status across genes") +
ggplot2::theme_linedraw()Session info
sessionInfo()## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BiocParallel_1.28.3 copykit_0.1.1
## [3] DNAcopy_1.68.0 Rsubread_2.8.1
## [5] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
## [7] Biobase_2.54.0 GenomicRanges_1.46.1
## [9] GenomeInfoDb_1.30.1 IRanges_2.28.0
## [11] S4Vectors_0.32.3 BiocGenerics_0.40.0
## [13] MatrixGenerics_1.6.0 matrixStats_0.61.0
## [15] knitr_1.40
##
## loaded via a namespace (and not attached):
## [1] circlize_0.4.14 igraph_1.2.11 lazyeval_0.2.2
## [4] splines_4.1.2 ggplot2_3.3.5 amap_0.8-18
## [7] digest_0.6.29 foreach_1.5.2 yulab.utils_0.0.4
## [10] htmltools_0.5.2 magick_2.7.3 viridis_0.6.2
## [13] ggalluvial_0.12.3 fansi_1.0.2 magrittr_2.0.2
## [16] cluster_2.1.2 doParallel_1.0.17 mixtools_1.2.0
## [19] fastcluster_1.2.3 ComplexHeatmap_2.10.0 rmdformats_1.0.4
## [22] colorspace_2.0-3 xfun_0.30 dplyr_1.0.8
## [25] crayon_1.5.0 RCurl_1.98-1.6 jsonlite_1.8.0
## [28] survival_3.3-0 iterators_1.0.14 ape_5.6-2
## [31] glue_1.6.2 gtable_0.3.0 zlibbioc_1.40.0
## [34] XVector_0.34.0 GetoptLong_1.0.5 DelayedArray_0.20.0
## [37] kernlab_0.9-29 shape_1.4.6 prabclus_2.3-2
## [40] DEoptimR_1.0-10 scales_1.1.1 DBI_1.1.2
## [43] miniUI_0.1.1.1 Rcpp_1.0.8.1 viridisLite_0.4.0
## [46] xtable_1.8-4 clue_0.3-60 gridGraphics_0.5-1
## [49] tidytree_0.3.8 mclust_5.4.9 FNN_1.1.3
## [52] RColorBrewer_1.1-3 fpc_2.2-9 modeltools_0.2-23
## [55] ellipsis_0.3.2 farver_2.1.0 pkgconfig_2.0.3
## [58] flexmix_2.3-17 nnet_7.3-17 sass_0.4.0
## [61] uwot_0.1.11 utf8_1.2.2 labeling_0.4.2
## [64] ggplotify_0.1.0 tidyselect_1.1.2 rlang_1.0.1
## [67] later_1.3.0 munsell_0.5.0 tools_4.1.2
## [70] cli_3.2.0 dbscan_1.1-10 generics_0.1.2
## [73] evaluate_0.15 stringr_1.4.0 fastmap_1.1.0
## [76] yaml_2.3.5 ggtree_3.2.1 robustbase_0.93-9
## [79] purrr_0.3.4 nlme_3.1-155 mime_0.12
## [82] aplot_0.1.2 compiler_4.1.2 rstudioapi_0.13
## [85] beeswarm_0.4.0 png_0.1-7 treeio_1.18.1
## [88] tibble_3.1.6 bslib_0.3.1 stringi_1.7.6
## [91] highr_0.9 RSpectra_0.16-0 forcats_0.5.1
## [94] lattice_0.20-45 bluster_1.4.0 Matrix_1.4-0
## [97] vctrs_0.3.8 pillar_1.7.0 lifecycle_1.0.1
## [100] jquerylib_0.1.4 GlobalOptions_0.1.2 BiocNeighbors_1.12.0
## [103] bitops_1.0-7 httpuv_1.6.5 patchwork_1.1.1
## [106] R6_2.5.1 bookdown_0.29 promises_1.2.0.1
## [109] gridExtra_2.3 vipor_0.4.5 codetools_0.2-18
## [112] MASS_7.3-55 gtools_3.9.2 assertthat_0.2.1
## [115] rjson_0.2.21 withr_2.4.3 GenomeInfoDbData_1.2.7
## [118] diptest_0.76-0 parallel_4.1.2 grid_4.1.2
## [121] ggfun_0.0.5 tidyr_1.2.0 class_7.3-20
## [124] rmarkdown_2.16 segmented_1.4-0 ggnewscale_0.4.6
## [127] shiny_1.7.1 ggbeeswarm_0.6.0